March 23, 2016
I recently gave a talk at GDC on using the survivor model to identify casual factors in churn. This post is to publish the code I used in that presentation and is useful for individuals who would like to practice creating a survivor model. Survivor models are popular for modeling single incidence events such as player churn. Another popular use of survivor model is to analyze a player's first MTX purchase.
Imagine a situation where your user acquisition team launched a new advertisement campaign. They used a provocative ad that generates a lot of new users. However, your game is free to play and you monetize steadily over the life of the player. You want to use churn as a proxy for quality, since if the players do not stay in your game long enough, it is impossible to monetize on them.
The code below, when pasted into Excel, will create some sample data for the first scenario. Please paste the formula in at least 1000 rows so you have enough data.
In [ ]:
Random Start End Churned DaysPlayed Amazon Google Other PurchasePrice FirstSessionLength
=RAND() =ROUND(RAND()*5,0) =B2+ROUND(RAND()*10,0)+1 =IF((((C2-B2)^-2)*A2)<0.02,1,0) =C2-B2 =IF(0.5+D2*0.25-RAND()^0.5<-0.15,1,0) =IF(F2=0,IF(0.5+D2*0.25-RAND()^0.5<0,1,0),0) =IF(G2+F2=0,1,0) =ROUND((0.25-D2*0.25+RAND()^0.3)*50,-1) =ROUND((0.5-D2*0.5+3*(RAND()^0.5))*50,0)
Since all numbers where generated without a set seed, I include the actual values used in this analysis below. In constructing this dataset, each row represents one player.
The next piece is to run the R code to generate the survivor model.
In [ ]:
# Read in your data
churn <- read.csv("path_to_your_data/Ad Data.csv")
# Attaching allows you to refer to your data with shorthand
attach(churn)
# Initialize the survival library. This steps assumes you have already installed the library
library("survival")
# Please see the survival library documentation for a full reference of
# all the survival functions.
#Here we calculate the incident of churn each day
# from the number of players that are still active
churn.surv <- survfit(Surv(DaysPlayed, Churned)~ 1, conf.type="none")
summary(churn.surv)
# Plotting survival probability
plot(churn.surv, xlab="DaysPlayed", ylab="Survival Probability")
# The first churn model uses the exponential distribution
churn.survexp <- survreg(Surv(DaysPlayed, Churned) ~
Amazon+Google+FirstSessionLength+PurchasePrice, dist="exponential")
summary(churn.survexp)
# The second churn model uses the Weibull distribution
churn.survWeibull <- survreg(Surv(DaysPlayed, Churned) ~
Amazon+Google+FirstSessionLength+PurchasePrice, dist="weibull")
summary(churn.survWeibull)
library(lmtest)
# The test results show that the models are stastically different
# and the Weibull fits the data better
lrtest(churn.survexp,churn.survWeibull)
# clean up
detach(churn)
Here is the condensed output from the above code:
In [ ]:
survfit(formula = Surv(DaysPlayed, Churned) ~ 1, conf.type = "none")
time n.risk n.event survival std.err
1 104856 98 0.999 9.44e-05
2 99529 846 0.991 3.05e-04
3 89208 1854 0.970 5.60e-04
4 78741 3382 0.928 8.82e-04
5 68210 5222 0.857 1.25e-03
6 57770 7564 0.745 1.62e-03
7 47249 10480 0.580 1.90e-03
8 36573 10386 0.415 1.93e-03
9 26187 10372 0.251 1.71e-03
10 15815 10514 0.084 1.10e-03
11 5301 5301 0.000 NaN
survreg(formula = Surv(DaysPlayed, Churned) ~ Amazon + Google +
FirstSessionLength + PurchasePrice, dist = "exponential")
Value Std. Error z p
(Intercept) 1.18025 0.017063 69.2 0.00e+00
Amazon 0.40787 0.010726 38.0 0.00e+00
Google 0.11250 0.008739 12.9 6.32e-38
FirstSessionLength 0.00272 0.000105 25.9 1.44e-147
PurchasePrice 0.01661 0.000351 47.3 0.00e+00
Scale fixed at 1
Exponential distribution
Loglik(model)= -212158.6 Loglik(intercept only)= -214884.3
Chisq= 5451.41 on 4 degrees of freedom, p= 0
Number of Newton-Raphson Iterations: 4
n= 104856
survreg(formula = Surv(DaysPlayed, Churned) ~ Amazon + Google +
FirstSessionLength + PurchasePrice, dist = "weibull")
Value Std. Error z p
(Intercept) 2.098805 4.50e-03 466.85 0.00e+00
Amazon 0.012933 2.48e-03 5.21 1.84e-07
Google 0.003131 2.03e-03 1.54 1.24e-01
FirstSessionLength 0.000143 2.54e-05 5.62 1.95e-08
PurchasePrice 0.000690 9.01e-05 7.65 1.96e-14
Log(scale) -1.456626 3.02e-03 -482.26 0.00e+00
Scale= 0.233
Weibull distribution
Loglik(model)= -146039.7 Loglik(intercept only)= -146101
Chisq= 122.62 on 4 degrees of freedom, p= 0
Number of Newton-Raphson Iterations: 15
n= 104856
lrtest(churn.survexp,churn.survWeibull)
Likelihood ratio test
Model 1: Surv(DaysPlayed, Churned) ~ Amazon + Google + FirstSessionLength +
PurchasePrice
Model 2: Surv(DaysPlayed, Churned) ~ Amazon + Google + FirstSessionLength +
PurchasePrice
#Df LogLik Df Chisq Pr(>Chisq)
1 5 -212159
2 6 -146040 1 132238 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
By comparing the two models, we can see that they are statistically different and the Weibull distribution fits the data better. Now that we have identified our game data as having a Weibull distribution in churn, let us analyze the effect of the new ad.
In [ ]:
# Read in the ad data
cyborg <- read.csv("path_to_your_data//Churn Data with Cyborg Campaign.csv")
attach(cyborg)
library("survival")
# Run the Weibull model again, but with the added CyborgAd variable
cyborg.survWeibull <- survreg(Surv(DaysPlayed, Churned) ~
CyborgAd+ Amazon+Google+FirstSessionLength+PurchasePrice, dist="weibull")
summary(cyborg.survWeibull)
detach(cyborg)
With the following condensed output:
In [ ]:
survreg(formula = Surv(DaysPlayed, Churned) ~ CyborgAd + Amazon +
Google + FirstSessionLength + PurchasePrice, dist = "weibull")
Value Std. Error z p
(Intercept) 2.098137 5.33e-03 393.98 0.00e+00
CyborgAd -0.095127 2.12e-03 -44.81 0.00e+00
Amazon 0.012497 2.85e-03 4.39 1.12e-05
Google -0.002237 2.33e-03 -0.96 3.37e-01
FirstSessionLength 0.000121 2.94e-05 4.11 3.91e-05
PurchasePrice 0.000502 1.03e-04 4.88 1.08e-06
Log(scale) -1.233473 2.82e-03 -437.16 0.00e+00
Scale= 0.291
Weibull distribution
Loglik(model)= -180633.5 Loglik(intercept only)= -181713.9
Chisq= 2160.92 on 5 degrees of freedom, p= 0
Number of Newton-Raphson Iterations: 11
n= 104856
Here, negative coefficients indicate an increased likelihood to churn. The results show that although the Cyborg ad had a solid acquisition rate, the players did not retain as well. To calculate whether the campaign itself was profitable, you would have to look at your monetization of players with the shorter duration of play and compare it to your acquisition costs.
The premise of the second scenario is that the game is patched. The patch makes these changes:
Unbeknownst to the developers, the patch also includes a critical bug that only affects wizards.
Here is the SQL code used to generate the random data. The SQL code uses window analytic functions and should run in systems using PostgreSQL style syntax.
In [ ]:
/* A rough order of operations for how the code runs:
Create a table of random variables.
Apply player specific qualities based on those stats, such as which class they choose.
Create daily specific variables, including some additional
random variables such as whether they had a bug that day.
Calculate when a player churns. Since this is fake data, a player can "churn" multiple times.
To correct the fake data, anyone marked as churned has all data truncated after their first
churn flag using window functions.
*/
drop table if exists schema_name_here.churn_test_model;
-- create a permanent table
create table schema_name_here.churn_test_model as (
with random_sequence as (
-- create a set of random numbers to build the data set
select setseed(0) as null_seed, null as quit_seed, null as start_date_seed
, null as class_seed, null as patch_seed
union all
select null
-- approximate a random normal variable
, random() + random() + random() + random() + random() + random() + random()
+ random() + random() + random() + random() + random()-6
, random()
, random()
, random()
from generate_series(1, 100000)
offset 1)
, seed_values as
-- create the player specific data
(select *
, trunc(start_date_seed*8) as start_date
, case when start_date_seed*8<3 and class_seed <.6 then 1 -- warrior, pre patch
when start_date_seed*8<3 then 2 -- wizard, pre patch
when class_seed<.6 and patch_seed<.5 then 1 -- warrior, no patch
when patch_seed<.5 then 2 -- wizard, no patch
when class_seed <.8 then 1 -- warrior, patch
else 2 -- wizard, patch
end as class
, case when patch_seed>=.5 then 1
else 0 end as patch_group
from random_sequence
)
, random_day as
-- start creating the daily variables
(select *
, start_date+age_of_account as start
, start_date+age_of_account+1 as stop
, random() as bug_seed
, random()+random()*random()-3 minutes_played_seed
, case when patch_seed>=.5 and start_date+age_of_account>3 then 1
else 0 end as patch
-- make the daily calculations have noise
, (random() + random() + random() + random() + random() + random()
+ random() + random() + random() +
random() + random() + random()-6)/4 as random_noise
from seed_values
cross join
(select row_number() over() as age_of_account
from generate_series(1, 10)) as study_days
)
, day_event as
-- add in daily bugs
(select *
-- bugs vary depending on patch and class
, case when class=2 then 1 else 0 end as is_wizard
, case when bug_seed>.9 and patch=0 then 1
when bug_seed>.85 and patch=1 and class=2 then 1
when bug_seed>.95 and patch=1 and class=1 then 1
else 0 end as bug
-- time played varies on bug chance
, case when bug_seed>.9 and patch=0 then @minutes_played_seed*15
when bug_seed>.85 and patch=1 and class=2 then @minutes_played_seed*15
when bug_seed>.95 and patch=1 and class=1 then @minutes_played_seed*15
else @minutes_played_seed*20 end as minutes_played
from random_day
)
, input_variable as
-- calculate when the player churns
(select *
, -log(age_of_account+1)*0.5+ln(minutes_played)/20
+(@quit_seed)/4+patch*0.025+random_noise/6
-bug*0.12-is_wizard*patch*0.04 as test
, case when -log(age_of_account+1)*0.5+ln(minutes_played)/20
+(@quit_seed)/4+patch*0.025
+random_noise/6-bug*0.12-is_wizard*patch*0.04<0
then 1 else 0 end as churn
from day_event
)
, order_data as
-- ordering the data by individual player churn
(select start, stop, age_of_account, patch, bug, minutes_played, quit_seed
, patch_group, is_wizard, test, bug_seed
, max(churn) over
(partition by quit_seed order by quit_seed
, age_of_account desc range between current row and unbounded following) as churn
from input_variable
)
, limiting as
-- select the first time a player churns
(select *
, lag(churn) over (partition by quit_seed order by age_of_account) as lag_churn
from order_data
)
select start, stop, age_of_account, patch, bug, minutes_played, is_wizard, churn
, patch_group, quit_seed, test, bug_seed
from limiting
where (lag_churn<>1 or lag_churn is null) and stop<=10
)
This code generates a random set of data for our model. The actual data used can be at the bottom of this post. Notice that each day of data is its own row, and the quit_seed variable actually functions as the player id. The data is organized similar to a panel dataset.
Here is the R code used to run the model:
In [ ]:
library(RODBC)
myconn<-odbcConnect('This_is_your_odbc_connection')
patch_data = sqlQuery(myconn,"select * from schema_name_here.churn_test_model")
attach(patch_data)
summary(coxph(Surv(start, stop, churn) ~
bug+is_wizard+patch+age_of_account+minutes_played+cluster(quit_seed), data=patch_data))
The results of the code can be found here:
In [ ]:
coxph(formula = Surv(start, stop, churn) ~ bug + is_wizard +
patch + age_of_account + minutes_played + cluster(quit_seed),
data = patch_data)
n= 382440, number of events= 59227
coef exp(coef) se(coef) robust se z Pr(>|z|)
bug 1.3921964 4.0236780 0.0112569 0.0110634 125.838 <2e-16 ***
is_wizard 0.0830510 1.0865972 0.0086099 0.0092643 8.965 <2e-16 ***
patch -0.0828029 0.9205325 0.0089436 0.0094945 -8.721 <2e-16 ***
age_of_account 0.1582036 1.1714047 0.0023148 0.0024072 65.720 <2e-16 ***
minutes_played -0.0196435 0.9805482 0.0005978 0.0005983 -32.832 <2e-16 ***
Here, positive coefficients indicate a higher likelihood to churn. The initial results show that the patch reduces churn. However, in the forums, customer service reports an increased number of complaints from wizards. Forums are notoriously hard to extrapolate to the entire player community; so instead, let us use the survivor model to see what is happening with wizards and the patch by creating an interaction variable.
In [ ]:
patch_data$wizard_bug <- is_wizard*bug
summary(coxph(Surv(start, stop, churn) ~
bug+is_wizard+patch+age_of_account+minutes_played+
wizard_patch+cluster(quit_seed), data=patch_data))
detach(patch_data)
The formula stays the same except for the addition of the new variable, wizard_patch.
In [ ]:
coxph(formula = Surv(start, stop, churn) ~ bug + is_wizard +
patch + age_of_account + minutes_played + wizard_patch +
cluster(quit_seed), data = patch_data)
n= 382440, number of events= 59227
coef exp(coef) se(coef) robust se z Pr(>|z|)
bug 1.3777800 3.9660871 0.0113080 0.0111103 124.009 <2e-16 ***
is_wizard -0.0054967 0.9945184 0.0109240 0.0115514 -0.476 0.634
patch -0.1666453 0.8464998 0.0109573 0.0112832 -14.769 <2e-16 ***
age_of_account 0.1560963 1.1689388 0.0023195 0.0024141 64.659 <2e-16 ***
minutes_played -0.0196420 0.9805496 0.0005978 0.0005983 -32.830 <2e-16 ***
wizard_patch 0.2353473 1.2653481 0.0176368 0.0188478 12.487 <2e-16 ***
When you separate out wizards who receive the patch, you find that they had increased likelihood to churn due to the increased incidence in bugs.
While the examples used in this blog are trivial, they are useful for showing how the survival model can fit single incidence events with non-normal distributions such as churn quite well. Feeding the data into the model is straightforward; and, in the modelling stage, you have a lot of flexibility in what variables you include. Ideally, you would have some sort of experiment running to give you clear causation, but even correlations can help you understand how different factors relate to each other.